Name: Anuj Subedi
Student Id: 004009141
Importing Libraries and Data Loading¶
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Loading data
df = pd.read_csv('global_housing_market_extended.csv')
df
| Country | Year | House Price Index | Rent Index | Affordability Ratio | Mortgage Rate (%) | Inflation Rate (%) | GDP Growth (%) | Population Growth (%) | Urbanization Rate (%) | Construction Index | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USA | 2015 | 117.454012 | 116.550001 | 9.587945 | 4.493292 | 1.514121 | -0.752044 | -0.796707 | 85.985284 | 118.089201 |
| 1 | USA | 2016 | 150.807258 | 51.440915 | 11.729189 | 5.662213 | 1.880204 | -0.545400 | -0.358084 | 69.127267 | 111.980515 |
| 2 | USA | 2017 | 123.194502 | 70.386040 | 8.506676 | 2.197469 | 2.398940 | 0.930895 | 0.596245 | 83.555279 | 85.973903 |
| 3 | USA | 2018 | 131.423444 | 91.469020 | 3.418054 | 4.537724 | 1.608407 | -1.479587 | 2.321099 | 88.968961 | 134.671788 |
| 4 | USA | 2019 | 110.461377 | 56.837048 | 9.158097 | 3.700762 | 1.293249 | 1.961415 | -0.879640 | 87.279612 | 90.702399 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | UAE | 2020 | 160.153711 | 117.170365 | 7.390686 | 2.048681 | 4.061737 | 1.635019 | 1.955250 | 62.942477 | 109.059291 |
| 196 | UAE | 2021 | 95.004866 | 72.727315 | 9.636214 | 3.880091 | 2.943274 | 1.155810 | 0.608064 | 83.550496 | 141.366775 |
| 197 | UAE | 2022 | 175.533469 | 105.083237 | 5.838662 | 4.940674 | 3.344420 | 0.037365 | 1.943051 | 61.152790 | 142.140959 |
| 198 | UAE | 2023 | 126.147746 | 94.604103 | 8.934185 | 5.975589 | 4.638353 | 2.911469 | -0.766718 | 75.552241 | 82.013520 |
| 199 | UAE | 2024 | 153.743377 | 85.855534 | 9.122050 | 1.708365 | 1.051148 | 3.730587 | -0.747705 | 62.137702 | 70.968678 |
200 rows × 11 columns
Data Understanding and Cleaning¶
Understanding the dataset¶
# Checking first few rows
print("First five rows of the dataset:")
df.head()
First five rows of the dataset:
| Country | Year | House Price Index | Rent Index | Affordability Ratio | Mortgage Rate (%) | Inflation Rate (%) | GDP Growth (%) | Population Growth (%) | Urbanization Rate (%) | Construction Index | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USA | 2015 | 117.454012 | 116.550001 | 9.587945 | 4.493292 | 1.514121 | -0.752044 | -0.796707 | 85.985284 | 118.089201 |
| 1 | USA | 2016 | 150.807258 | 51.440915 | 11.729189 | 5.662213 | 1.880204 | -0.545400 | -0.358084 | 69.127267 | 111.980515 |
| 2 | USA | 2017 | 123.194502 | 70.386040 | 8.506676 | 2.197469 | 2.398940 | 0.930895 | 0.596245 | 83.555279 | 85.973903 |
| 3 | USA | 2018 | 131.423444 | 91.469020 | 3.418054 | 4.537724 | 1.608407 | -1.479587 | 2.321099 | 88.968961 | 134.671788 |
| 4 | USA | 2019 | 110.461377 | 56.837048 | 9.158097 | 3.700762 | 1.293249 | 1.961415 | -0.879640 | 87.279612 | 90.702399 |
# Checking last few rows
print("Last five rows of the dataset:\n")
df.tail()
Last five rows of the dataset:
| Country | Year | House Price Index | Rent Index | Affordability Ratio | Mortgage Rate (%) | Inflation Rate (%) | GDP Growth (%) | Population Growth (%) | Urbanization Rate (%) | Construction Index | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 195 | UAE | 2020 | 160.153711 | 117.170365 | 7.390686 | 2.048681 | 4.061737 | 1.635019 | 1.955250 | 62.942477 | 109.059291 |
| 196 | UAE | 2021 | 95.004866 | 72.727315 | 9.636214 | 3.880091 | 2.943274 | 1.155810 | 0.608064 | 83.550496 | 141.366775 |
| 197 | UAE | 2022 | 175.533469 | 105.083237 | 5.838662 | 4.940674 | 3.344420 | 0.037365 | 1.943051 | 61.152790 | 142.140959 |
| 198 | UAE | 2023 | 126.147746 | 94.604103 | 8.934185 | 5.975589 | 4.638353 | 2.911469 | -0.766718 | 75.552241 | 82.013520 |
| 199 | UAE | 2024 | 153.743377 | 85.855534 | 9.122050 | 1.708365 | 1.051148 | 3.730587 | -0.747705 | 62.137702 | 70.968678 |
# Number of rows and columns
rows,columns = df.shape
print(f"The number of rows = {rows}")
print(f"The number of columns = {columns}")
The number of rows = 200 The number of columns = 11
# Column names
print("The column names are:")
for i in df.columns:
print(f" - {i}")
The column names are: - Country - Year - House Price Index - Rent Index - Affordability Ratio - Mortgage Rate (%) - Inflation Rate (%) - GDP Growth (%) - Population Growth (%) - Urbanization Rate (%) - Construction Index
# numerical columns
num_cols = df.select_dtypes(include=['number']).columns
# categorical columns
cat_cols = df.select_dtypes(include=['object', 'category']).columns
print("Numerical columns:")
for i in num_cols:
print(i)
print("\nCategorical columns:")
for j in cat_cols:
print(j)
Numerical columns: Year House Price Index Rent Index Affordability Ratio Mortgage Rate (%) Inflation Rate (%) GDP Growth (%) Population Growth (%) Urbanization Rate (%) Construction Index Categorical columns: Country
The dataset contains one categorical column and ten numerical columns.
- Country: The nation represented in the row.
- Year: The calendar year of the record (2015–2024).
- House Price Index: A standardized metric tracking the change in residential property prices over time.
- Rent Index: A standardized metric tracking the change in rental prices over time.
- Affordability Ratio: A calculated measure indicating how affordable housing is relative to the average income in that country.
- Mortgage Rate (%): The average annual interest rate charged on loans for purchasing real estate.
- Inflation Rate (%): The percentage increase in the cost of general goods and services in the economy.
- GDP Growth (%): The annual percentage change in Gross Domestic Product, representing the country's economic expansion or contraction.
- Population Growth (%): The annual percentage change in the total population.
- Urbanization Rate (%): The percentage of the total population residing in urban (city) areas.
- Construction Index: A measure of activity in the construction sector, often indicating housing supply growth.
Cleaning data¶
# Check for missing data
print(f"The number of missing values: \n{df.isnull().sum()}")
The number of missing values: Country 0 Year 0 House Price Index 0 Rent Index 0 Affordability Ratio 0 Mortgage Rate (%) 0 Inflation Rate (%) 0 GDP Growth (%) 0 Population Growth (%) 0 Urbanization Rate (%) 0 Construction Index 0 dtype: int64
There are no columns with missing values. Thus, we don't have to worry about this during data analysis.
We copy the data frame into another before cleaning the data:
# Copying the data
df_new = df.copy()
# Standardizing the columns
df_new.rename(columns={
'House Price Index': 'Price_Index',
'Rent Index': 'Rent_Index',
'Affordability Ratio': 'Affordability_Ratio',
'Mortgage Rate (%)': 'Mortgage_Rate',
'Inflation Rate (%)': 'Inflation',
'GDP Growth (%)': 'GDP_Growth',
'Population Growth (%)': 'Pop_Growth',
'Urbanization Rate (%)': 'Urbanization_Rate',
'Construction Index': 'Construction_Index'
}, inplace=True)
df_new
| Country | Year | Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | Inflation | GDP_Growth | Pop_Growth | Urbanization_Rate | Construction_Index | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USA | 2015 | 117.454012 | 116.550001 | 9.587945 | 4.493292 | 1.514121 | -0.752044 | -0.796707 | 85.985284 | 118.089201 |
| 1 | USA | 2016 | 150.807258 | 51.440915 | 11.729189 | 5.662213 | 1.880204 | -0.545400 | -0.358084 | 69.127267 | 111.980515 |
| 2 | USA | 2017 | 123.194502 | 70.386040 | 8.506676 | 2.197469 | 2.398940 | 0.930895 | 0.596245 | 83.555279 | 85.973903 |
| 3 | USA | 2018 | 131.423444 | 91.469020 | 3.418054 | 4.537724 | 1.608407 | -1.479587 | 2.321099 | 88.968961 | 134.671788 |
| 4 | USA | 2019 | 110.461377 | 56.837048 | 9.158097 | 3.700762 | 1.293249 | 1.961415 | -0.879640 | 87.279612 | 90.702399 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | UAE | 2020 | 160.153711 | 117.170365 | 7.390686 | 2.048681 | 4.061737 | 1.635019 | 1.955250 | 62.942477 | 109.059291 |
| 196 | UAE | 2021 | 95.004866 | 72.727315 | 9.636214 | 3.880091 | 2.943274 | 1.155810 | 0.608064 | 83.550496 | 141.366775 |
| 197 | UAE | 2022 | 175.533469 | 105.083237 | 5.838662 | 4.940674 | 3.344420 | 0.037365 | 1.943051 | 61.152790 | 142.140959 |
| 198 | UAE | 2023 | 126.147746 | 94.604103 | 8.934185 | 5.975589 | 4.638353 | 2.911469 | -0.766718 | 75.552241 | 82.013520 |
| 199 | UAE | 2024 | 153.743377 | 85.855534 | 9.122050 | 1.708365 | 1.051148 | 3.730587 | -0.747705 | 62.137702 | 70.968678 |
200 rows × 11 columns
# Duplicated rows
print(f"The number of duplicated rows = {df.duplicated().sum()}")
The number of duplicated rows = 0
There are no duplicated rows. Thus, we don't have to worry about this during data analysis.
Next, we are checking whether there are any invalid values. For this, we check whether certain columns have negative values when they must be positive. We skip these columns:
- GDP Growth (%): Can be negative during a recession.
- Inflation Rate (%): Can be negative during deflation.
- Population Growth (%): Can be negative if people leave the country or when the birth rate is lower than the death rate.
# List of columns that must be positive
valid_columns = [
'Price_Index',
'Rent_Index',
'Affordability_Ratio',
'Mortgage_Rate',
'Urbanization_Rate',
'Construction_Index'
]
# Looping through the list to find and remove negative values
for col in valid_columns:
# Counting negatives
neg_count = (df_new[col] < 0).sum()
print(f"{col}: Found {neg_count} negative values.")
# Removing rows with negative values in this column
df_new = df_new[df_new[col] >= 0]
# Checking final shape
print(f"\nRows remaining after cleaning: {len(df_new)}")
Price_Index: Found 0 negative values. Rent_Index: Found 0 negative values. Affordability_Ratio: Found 0 negative values. Mortgage_Rate: Found 0 negative values. Urbanization_Rate: Found 0 negative values. Construction_Index: Found 0 negative values. Rows remaining after cleaning: 200
We are not converting 'Year' column to categorical just yet. It would be better for later visualizations if it remains numerical. For the Line Plots later, Python prefers Year to be a number so it can draw a continuous line. If we convert it to a category (string), the X-axis might behave strangely on line charts. We can treat it as a category specifically when making Bar Charts later if needed.
Additional Cleaning Steps¶
# Adding a new column 'Price_to_Rent_Ratio'
df_new['Price_to_Rent_Ratio'] = df_new['Price_Index'] / df_new['Rent_Index']
df_new
| Country | Year | Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | Inflation | GDP_Growth | Pop_Growth | Urbanization_Rate | Construction_Index | Price_to_Rent_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USA | 2015 | 117.454012 | 116.550001 | 9.587945 | 4.493292 | 1.514121 | -0.752044 | -0.796707 | 85.985284 | 118.089201 | 1.007756 |
| 1 | USA | 2016 | 150.807258 | 51.440915 | 11.729189 | 5.662213 | 1.880204 | -0.545400 | -0.358084 | 69.127267 | 111.980515 | 2.931660 |
| 2 | USA | 2017 | 123.194502 | 70.386040 | 8.506676 | 2.197469 | 2.398940 | 0.930895 | 0.596245 | 83.555279 | 85.973903 | 1.750269 |
| 3 | USA | 2018 | 131.423444 | 91.469020 | 3.418054 | 4.537724 | 1.608407 | -1.479587 | 2.321099 | 88.968961 | 134.671788 | 1.436808 |
| 4 | USA | 2019 | 110.461377 | 56.837048 | 9.158097 | 3.700762 | 1.293249 | 1.961415 | -0.879640 | 87.279612 | 90.702399 | 1.943475 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 195 | UAE | 2020 | 160.153711 | 117.170365 | 7.390686 | 2.048681 | 4.061737 | 1.635019 | 1.955250 | 62.942477 | 109.059291 | 1.366845 |
| 196 | UAE | 2021 | 95.004866 | 72.727315 | 9.636214 | 3.880091 | 2.943274 | 1.155810 | 0.608064 | 83.550496 | 141.366775 | 1.306316 |
| 197 | UAE | 2022 | 175.533469 | 105.083237 | 5.838662 | 4.940674 | 3.344420 | 0.037365 | 1.943051 | 61.152790 | 142.140959 | 1.670423 |
| 198 | UAE | 2023 | 126.147746 | 94.604103 | 8.934185 | 5.975589 | 4.638353 | 2.911469 | -0.766718 | 75.552241 | 82.013520 | 1.333428 |
| 199 | UAE | 2024 | 153.743377 | 85.855534 | 9.122050 | 1.708365 | 1.051148 | 3.730587 | -0.747705 | 62.137702 | 70.968678 | 1.790722 |
200 rows × 12 columns
We calculate 'Price_to_Rent_Ratio' to see if it's better to buy or rent. A high ratio suggests housing prices are growing faster than rent (potentially a bubble).
Exploratory Data Analysis¶
Structure and Summary Statistics¶
print("Dataset Structure:")
print(df_new.info())
Dataset Structure: <class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 200 non-null object 1 Year 200 non-null int64 2 Price_Index 200 non-null float64 3 Rent_Index 200 non-null float64 4 Affordability_Ratio 200 non-null float64 5 Mortgage_Rate 200 non-null float64 6 Inflation 200 non-null float64 7 GDP_Growth 200 non-null float64 8 Pop_Growth 200 non-null float64 9 Urbanization_Rate 200 non-null float64 10 Construction_Index 200 non-null float64 11 Price_to_Rent_Ratio 200 non-null float64 dtypes: float64(10), int64(1), object(1) memory usage: 18.9+ KB None
- The dataset contains 200 rows and 12 columns.
print("Summary Statistics (All Numeric Cols):")
display(df_new.describe())
Summary Statistics (All Numeric Cols):
| Year | Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | Inflation | GDP_Growth | Pop_Growth | Urbanization_Rate | Construction_Index | Price_to_Rent_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
| mean | 2019.500000 | 130.381022 | 83.048370 | 7.237768 | 4.150621 | 3.649756 | 2.133722 | 0.722806 | 74.769891 | 111.204257 | 1.672625 |
| std | 2.879489 | 28.752229 | 21.439858 | 2.576085 | 1.380222 | 1.881938 | 2.413270 | 1.035691 | 8.734246 | 24.309473 | 0.554341 |
| min | 2015.000000 | 80.552212 | 50.354311 | 3.041688 | 1.537814 | 0.532110 | -1.921833 | -0.961391 | 60.172760 | 70.968678 | 0.717881 |
| 25% | 2017.000000 | 104.142562 | 60.466671 | 5.034207 | 3.045278 | 1.939199 | -0.095628 | -0.183320 | 66.923842 | 90.182198 | 1.260941 |
| 50% | 2019.500000 | 129.193653 | 83.721711 | 7.375697 | 4.329643 | 3.664625 | 2.307548 | 0.722372 | 75.100060 | 110.593807 | 1.625208 |
| 75% | 2022.000000 | 157.127098 | 100.604665 | 9.276196 | 5.217706 | 5.261720 | 4.272776 | 1.621317 | 82.677430 | 133.780445 | 1.944121 |
| max | 2024.000000 | 179.971767 | 119.855388 | 11.879671 | 6.485623 | 6.912349 | 5.958931 | 2.497948 | 89.788944 | 149.735748 | 3.423656 |
Count of Countries and Year Range¶
# count of countires
unique_countries = df_new['Country'].unique()
print(f"Total Number of Countries: {len(unique_countries)}")
print(f"List of Countries: {unique_countries}")
# year range
unique_years = df_new['Year'].unique()
print(f"\nTotal Number of Years: {len(unique_years)}")
print(f"List of Years: {unique_years}")
print(f"\nYear Range: {df_new['Year'].min()} to {df_new['Year'].max()}")
Total Number of Countries: 20 List of Countries: ['USA' 'Canada' 'UK' 'Germany' 'France' 'Italy' 'Spain' 'Australia' 'India' 'China' 'Japan' 'Brazil' 'South Korea' 'Mexico' 'South Africa' 'Russia' 'Netherlands' 'Sweden' 'Switzerland' 'UAE'] Total Number of Years: 10 List of Years: [2015 2016 2017 2018 2019 2020 2021 2022 2023 2024] Year Range: 2015 to 2024
- The data covers a 10-year period from 2015 to 2024.
- There are 20 unique countries included in the analysis, including major economies like USA, China, India, and Germany.
Summary of Affordability and Housing Indicators¶
housing_cols = ['Price_Index', 'Rent_Index', 'Affordability_Ratio', 'Mortgage_Rate']
print("Summary of Key Housing Indicators:")
display(df_new[housing_cols].describe())
Summary of Key Housing Indicators:
| Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | |
|---|---|---|---|---|
| count | 200.000000 | 200.000000 | 200.000000 | 200.000000 |
| mean | 130.381022 | 83.048370 | 7.237768 | 4.150621 |
| std | 28.752229 | 21.439858 | 2.576085 | 1.380222 |
| min | 80.552212 | 50.354311 | 3.041688 | 1.537814 |
| 25% | 104.142562 | 60.466671 | 5.034207 | 3.045278 |
| 50% | 129.193653 | 83.721711 | 7.375697 | 4.329643 |
| 75% | 157.127098 | 100.604665 | 9.276196 | 5.217706 |
| max | 179.971767 | 119.855388 | 11.879671 | 6.485623 |
- House Price Index (Price_Index)
- Average: The mean price index is 130.38, indicating a general increase in housing prices compared to the base baseline (usually 100).
- Range: Prices vary significantly, ranging from a low of 80.55 to a high of 179.97. This suggests distinct periods or regions of housing booms versus market corrections.
- Rent Index (Rent_Index)
- Average: The mean rent index is 83.05.
- Comparison: The Rent Index average (83.05) is significantly lower than the House Price Index average (130.38). This gap suggests that, globally, property purchase prices have risen much faster than rental costs, which may drive more people toward renting.
- Affordability Ratio (Affordability_Ratio)
- Average: The global average ratio is 7.24.
- Range: The ratio ranges from 3.04 (highly affordable) to 11.88 (severely unaffordable). The standard deviation of 2.58 indicates that housing affordability is highly unequal across different countries or years.
- Mortgage Rate (Mortgage_Rate)
- Average: The average mortgage rate over this period is 4.15%.
- Range: Rates have fluctuated between a minimum of 1.54% (likely during aggressive monetary easing) and a maximum of 6.49%, reflecting changing global economic conditions and inflation responses.
Identify Temporal (Time) or Regional (Country) Anomalies¶
# We group by Year to see if any specific year stands out (e.g., 2020 Covid shock)
print("Average Price Index by Year (Temporal Trends):")
yearly_stats = df_new.groupby('Year')['Price_Index'].mean()
print(yearly_stats)
Average Price Index by Year (Temporal Trends): Year 2015 139.227551 2016 120.354279 2017 130.374609 2018 127.793459 2019 123.056015 2020 132.257842 2021 142.144052 2022 120.314413 2023 130.116069 2024 138.171933 Name: Price_Index, dtype: float64
# We find the Country with the highest and lowest Affordability on average
country_affordability = df_new.groupby('Country')['Affordability_Ratio'].mean().sort_values()
print(country_affordability)
print("\nMost Affordable Country (Average):")
print(country_affordability.head(1))
print("\nLeast Affordable Country (Average):")
print(country_affordability.tail(1))
Country Brazil 5.822374 Germany 6.101326 France 6.114604 China 6.122026 Japan 6.217082 Australia 6.359994 Mexico 6.399821 Canada 6.694199 USA 7.469501 Italy 7.471097 South Korea 7.540478 Spain 7.716534 UAE 7.736717 Netherlands 7.747943 Switzerland 7.772481 India 7.951057 UK 8.007954 Sweden 8.046817 South Africa 8.283574 Russia 9.179786 Name: Affordability_Ratio, dtype: float64 Most Affordable Country (Average): Country Brazil 5.822374 Name: Affordability_Ratio, dtype: float64 Least Affordable Country (Average): Country Russia 9.179786 Name: Affordability_Ratio, dtype: float64
Next, we check for specific "Anomaly" rows (e.g., extremely high Inflation). Let's define an anomaly as Inflation > 10% or Mortgage Rate > 10%.
# anamolies rows
anomalies = df_new[(df_new['Inflation'] > 10) | (df_new['Mortgage_Rate'] > 10)]
print(f"\nNumber of Economic Anomalies (High Inflation/Rates): {len(anomalies)}")
display(anomalies.head())
Number of Economic Anomalies (High Inflation/Rates): 0
| Country | Year | Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | Inflation | GDP_Growth | Pop_Growth | Urbanization_Rate | Construction_Index | Price_to_Rent_Ratio |
|---|
- Temporal Trends (Yearly Price Fluctuations)
- Volatility: The data shows that the Global House Price Index did not grow linearly. It fluctuated significantly over the decade.
- The 2021 Peak: The highest average Price Index occurred in 2021 (142.14). This aligns with global trends reflecting the post-pandemic housing boom driven by low interest rates and high demand.
- The 2022 Correction: Immediately following the peak, there was a sharp drop in 2022 (120.31), likely reflecting market corrections or the onset of rising interest rates to combat inflation. Recovery: By 2024, prices began to recover significantly (138.17), approaching 2021 levels again.
- Regional Trends (Affordability by Country)
- Most Affordable: Brazil appears to be the most affordable housing market in this dataset, with an average Affordability Ratio of 5.82.
- Least Affordable: Russia faces the biggest affordability challenge, with the highest average ratio of 9.18.
- Observation: Major western economies like the UK (8.01) and Sweden (8.05) also rank on the higher end of the un-affordability spectrum, while countries like Germany (6.10) and France (6.11) remained relatively more affordable during this period.
- Economic Anomalies
- Stability Check: We defined an "economic anomaly" as periods where Inflation or Mortgage Rates exceeded 10%.
- Result: The analysis found 0 rows meeting this criteria.
- Conclusion: This suggests that for the countries included in this dataset (2015-2024), macroeconomic conditions regarding interest rates and inflation remained within a standard range (below 10%), avoiding hyperinflation or extreme usury scenarios.
Summary of Data Preparation & Initial Analysis¶
- Data Integrity: We successfully loaded the dataset covering 20 countries from 2015–2024. A rigorous inspection confirmed the data is clean, containing 0 missing values and 0 duplicate rows.
- Preprocessing: We standardized column names (e.g., changing 'House Price Index' to 'Price_Index') to improve code readability and validated the data to ensure no impossible negative values existed in key metrics like mortgage rates.
- Feature Engineering: We derived a new feature, Price_to_Rent_Ratio, to enable deeper comparisons between the cost of buying versus renting.
- Key Insights:
- Temporal: Global housing prices peaked in 2021, followed by a correction in 2022.
- Regional: Brazil was identified as the most affordable market (ratio ~5.8), while Russia was the least affordable (ratio ~9.2).
- Stability: No extreme economic anomalies (Inflation or Rates >10%) were found, indicating a relatively stable macroeconomic environment in the dataset.
Data Visualization and Interpretation¶
Visualization 1: Correlation Heatmap¶
Purpose: Understand relationships between variables (e.g., Do high mortgage rates correlate with low house prices?).
# Select only numeric columns for correlation
numeric_cols = df_new.select_dtypes(include=['float64', 'int64'])
corr_matrix = numeric_cols.corr()
corr_matrix
| Year | Price_Index | Rent_Index | Affordability_Ratio | Mortgage_Rate | Inflation | GDP_Growth | Pop_Growth | Urbanization_Rate | Construction_Index | Price_to_Rent_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Year | 1.000000 | 0.036894 | -0.128500 | -0.036467 | -0.067791 | -0.017648 | 0.001367 | 0.105136 | -0.096955 | -0.009809 | 0.115638 |
| Price_Index | 0.036894 | 1.000000 | 0.098280 | -0.132869 | 0.069741 | 0.050327 | -0.080581 | 0.076943 | -0.005663 | -0.108723 | 0.575846 |
| Rent_Index | -0.128500 | 0.098280 | 1.000000 | -0.053381 | -0.000694 | 0.020875 | -0.098396 | 0.008684 | -0.022523 | -0.026966 | -0.721132 |
| Affordability_Ratio | -0.036467 | -0.132869 | -0.053381 | 1.000000 | -0.010473 | -0.003447 | -0.036567 | -0.034430 | -0.180839 | -0.045611 | -0.051140 |
| Mortgage_Rate | -0.067791 | 0.069741 | -0.000694 | -0.010473 | 1.000000 | -0.026490 | 0.121195 | -0.066062 | 0.025087 | 0.004873 | 0.065109 |
| Inflation | -0.017648 | 0.050327 | 0.020875 | -0.003447 | -0.026490 | 1.000000 | 0.080694 | 0.119187 | -0.081713 | 0.001767 | 0.041282 |
| GDP_Growth | 0.001367 | -0.080581 | -0.098396 | -0.036567 | 0.121195 | 0.080694 | 1.000000 | -0.071452 | -0.004604 | -0.095127 | 0.051115 |
| Pop_Growth | 0.105136 | 0.076943 | 0.008684 | -0.034430 | -0.066062 | 0.119187 | -0.071452 | 1.000000 | -0.044801 | 0.065321 | 0.043891 |
| Urbanization_Rate | -0.096955 | -0.005663 | -0.022523 | -0.180839 | 0.025087 | -0.081713 | -0.004604 | -0.044801 | 1.000000 | -0.021579 | 0.026507 |
| Construction_Index | -0.009809 | -0.108723 | -0.026966 | -0.045611 | 0.004873 | 0.001767 | -0.095127 | 0.065321 | -0.021579 | 1.000000 | -0.067419 |
| Price_to_Rent_Ratio | 0.115638 | 0.575846 | -0.721132 | -0.051140 | 0.065109 | 0.041282 | 0.051115 | 0.043891 | 0.026507 | -0.067419 | 1.000000 |
# Correlation Heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', fmt=".2f", linewidths=0.5, vmin=-1, vmax=1)
plt.title('Correlation Heatmap of Housing & Economic Indicators')
plt.savefig('Heatmap.png')
plt.show()
Interpretation:¶
- The "Decoupling" of Prices and Rent (Anomaly)
- Observation: There is an extremely weak correlation between Price_Index and Rent_Index (0.098).
- Insight: Conventionally, house prices and rents move together. The fact that they are nearly uncorrelated in this dataset suggests a market dislocation where speculative buying might be driving prices up independent of the actual rental value of the properties.
- Mortgage Rates vs. Prices (Economic Anomaly)
- Observation: The correlation between Mortgage_Rate and Price_Index is slightly positive (0.07).
- Insight: Economic theory suggests a negative correlation (higher rates usually cool down prices). The lack of a negative correlation here indicates that global housing prices remained resilient even as interest rates fluctuated, or that supply shortages kept prices high despite borrowing costs.
- Dynamics of the Price-to-Rent Ratio
- Observation: The derived Price_to_Rent_Ratio shows a strong negative correlation with Rent_Index (-0.72) and a moderate positive correlation with Price_Index (0.58).
- Insight: This confirms that changes in affordability (buy vs. rent) in this decade were driven more by fluctuating rental markets than by house prices alone.
- General Weakness of Correlations
- Observation: Most other economic indicators (GDP, Inflation, Urbanization) show correlations near zero with housing metrics.
- Insight: This implies that global "macro" trends are not the sole driver of housing markets. Localized factors (specific to each country) likely play a much larger role than global averages
Visualization 2: Average House Price Index by Country (Bar Plot)¶
Purpose: Compare the most expensive/inflated markets against the stable ones.
avg_price = df_new.groupby('Country')['Price_Index'].mean().sort_values(ascending=False).reset_index()
plt.figure(figsize=(14, 7))
sns.barplot(data=avg_price, x='Price_Index', y='Country', palette='viridis', hue='Country', legend=False)
plt.title('Average House Price Index by Country (2015-2024)')
plt.xlabel('Average Price Index')
plt.ylabel('Country')
plt.axvline(x=100, color='red', linestyle='--', label='Baseline (100)')
plt.legend()
plt.savefig("Price_Index.png")
plt.show()
Interpretation¶
- Top Markets: Switzerland exhibits the highest average House Price Index over the decade (approaching ~145), followed closely by Brazil and the Netherlands. This suggests these countries experienced the most sustained upward pressure on housing prices relative to the baseline.
- Universal Inflation: Crucially, every single country on this list has an average index above the red baseline of 100. This confirms a universal global trend: housing assets appreciated in nominal terms across all 20 economies studied between 2015 and 2024.
- The "Low" End: Canada, South Africa, and Sweden appear at the bottom of this specific ranking. However, even the lowest country (Canada) still sits well above the 100 baseline (approx. 112), showing that even the "cooler" markets in this dataset still saw growth.
Visualization 3: Mortgage Rate Distribution (Histogram)¶
Purpose: Understand the spread of interest rates globally.
# histogram
plt.figure(figsize=(12, 6))
sns.histplot(df_new['Mortgage_Rate'], bins=20, kde=True, color='teal', edgecolor='black')
plt.title('Global Distribution of Mortgage Rates (2015-2024)')
plt.xlabel('Mortgage Rate (%)')
plt.ylabel('Frequency (Count of Records)')
plt.savefig('Mortgage_Rate.png')
plt.show()
Interpretation¶
- Broad Spread: The mortgage rates in this dataset are widely distributed, ranging from approximately 1.5% to 6.5%. This reflects the diversity of economic policies across the 20 different countries.
- The "Normal" Range: The tallest bars (highest frequency) appear in the 4.0% to 5.5% range. This suggests that for most of the last decade, a mortgage rate of around 4.5% was the global standard.
- Clusters: There is a cluster of low rates (1.5% - 2.5%), likely driven by countries with historically low interest rate policies (like Japan or Switzerland) or the global rate cuts during the pandemic. There is a cluster of moderate rates (4.0% - 6.0%), likely representing the standard operating economies (like the USA or UK) during non-crisis years.
- No Extremes: Consistent with our cleaning phase, there are no rates above 7% or below 0%, confirming the stability of the dataset.
Visualization 4: House Price vs Rent Index (Scatter Plot)¶
Purpose: Explore "Buy vs Rent" dynamics. If prices are high but rent is low, it suggests a market bubble.
# scatterplot
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df_new, x='Rent_Index', y='Price_Index',
hue='Country', size='Affordability_Ratio', sizes=(20, 200), alpha=0.7)
# Add a diagonal line (x=y) to show equilibrium
plt.plot([50, 180], [50, 180], color='red', linestyle='--', label='Equal Price/Rent Growth')
plt.title('House Price Index vs Rent Index')
plt.xlabel('Rent Index')
plt.ylabel('House Price Index')
plt.legend(title='Country')
plt.savefig('Buy_vs_Rent.png')
plt.show()
Interpretation¶
- The "Buy vs. Rent" Gap
- Dominant Trend: The most significant observation is that most of the data points lie above the red dashed line.
- Meaning: This indicates that globally, House Prices (Y-axis) are consistently higher than Rent Prices (X-axis) relative to their baselines. In other words, the cost of purchasing a home has appreciated much faster than the cost of renting one.
- Potential "Asset Bubble": The cluster of points in the top-left quadrant (High Price, Low Rent) represents markets that might be overheated. In these scenarios, speculative buying drives up asset prices even though the actual rental yield (income) hasn't kept pace.
- Color clusters (Countries): Colors represent different countries. There is no strong clustering by country, meaning housing markets behave very differently even among similar economies. Some developed countries (like Western Europe) tend to lie higher on the y-axis, but overall the scatter is very mixed.
- The bubble sizes (Affordability Ratios): The bubble size represents the affordability ratio. Here, Larger bubbles meanless affordable housing. Large bubbles appear across various rent levels and across various house price levels. This shows that affordability problems exist at all points—not only where house prices are extremely high.
Visualization 5: Affordability Ratio Over Time (Line Plot)¶
Purpose: Track the global and regional trends in how hard it is to buy a home.
plt.figure(figsize=(14, 8))
sns.lineplot(data=df_new, x='Year', y='Affordability_Ratio', hue='Country', marker='o', linewidth=2)
plt.title('Trends in Housing Affordability Ratio (2015-2024)')
plt.ylabel('Affordability Ratio (Lower is Better)')
plt.xlabel('Year')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.6)
plt.savefig('Affordability_Ratio.png', bbox_inches='tight')
plt.show()
divide these into 4 different groups based on affordability ratio and get 4 separate graphs. you can use plotly
Interpretation¶
- High Volatility (The "Zig-Zag" Effect) Observation: The most obvious feature of this chart is the extreme volatility. Almost no country follows a smooth, predictable path. Lines spike and dip sharply year-over-year. Insight: This suggests that housing affordability is highly sensitive to short-term economic shocks—such as sudden interest rate hikes, inflation spikes, or GDP contractions—rather than being a slow-moving, stable metric.
- The 2023-2024 Outlook Observation: Looking at the far right of the chart (2023-2024), we see a wide dispersion. While some countries have seen affordability improve (lines going down), several others are spiking toward the 10-12 range. Insight: This indicates a divergence in the global recovery. Some nations have successfully cooled their markets, while others are facing worsening affordability crises.
- No "Safe Haven" Observation: Even countries that appear at the bottom of the chart (most affordable) in one year often spike up in the next. Insight: There is no single country that remained consistently "highly affordable" (below a ratio of 4) throughout the entire decade. The housing cost burden is a shared global challenge that rotates between regions.
The chart is quite busy. It might be helpful to separate the plot into sub-plots.
Visualization 6: Geo map¶
Purpose: Visualize the data on a world map (we will use plotly for this).
# Preparing the data for the map
df_geo = df_new.copy()
# Renaming specific countries so Plotly recognizes them
df_geo['Country'] = df_geo['Country'].replace({
'UK': 'United Kingdom',
'USA': 'United States',
'UAE': 'United Arab Emirates'
})
# Creating Plotly Choropleth animation
fig = px.choropleth(df_geo,
locations='Country', # Using the Country Name column
locationmode='country names', # Tells Plotly these are names, not ISO codes
color='Price_Index', # The data we want to visualize
hover_name='Country',
animation_frame='Year', # Creating the play button/slider
projection='natural earth',
title='Global House Price Index Evolution (2015-2024)',
color_continuous_scale='Plasma',
range_color=[80, 180], # Fixing the color scale so it doesn't flicker between years
height=600)
fig.write_html("geo_map.html")
fig.show()
Interpretation¶
- Volatility
- Observation: Unlike stable economic metrics that change slowly, this map shifts colors rapidly.
- Example (Australia): When we look at 2016 (Purple/Low) vs 2020 (Bright Yellow/High).
- Example (Russia): It shifts from Orange (High) in 2015, drops to Purple (Low) in 2017, and heats up again in 2020.
- Insight: This visualizes the "boom and bust" cycles inherent in the global housing market over this decade.
- The 2021 Global Peak In the 2021 snapshot, we see a wave of warmer colors (Oranges and Yellows) spreading across different parts of the globe. This aligns with the post-pandemic global housing boom.
- The 2022 Correction & 2024 Resurgence
- Correction: In 2022, the map noticeably "cools down" (more purple appears), likely due to economic sanctions or market corrections.
- Resurgence: By 2024, the map heats up again. China turns bright yellow (Index ~165) and the UK/Europe return to high price index levels, suggesting a return to inflation in the housing sector.
- Regional Contrast The map highlights that housing inflation is not synchronized. In 2019, while Canada was "hot" (Yellow), China was "cool" (Purple). This reinforces the finding that local policies and economies drive housing prices more than global trends.
Optional Research Questions (The Extra 3 Visuals)¶
Question 1: Which countries faced the most extreme single-year price spikes?¶
To identify the countries with the most volatile housing markets, we calculate the year-over-year price changes for each country and extract the maximum single-year increase. This helps us understand which markets experienced the most dramatic price surges. The results are visualized using a bar plot for easy comparison.
# Sorting data by Country and Year to ensure calculation is correct
df_sorted = df_new.sort_values(['Country', 'Year'])
# Calculating the difference from the previous year
# group.diff() subtracts the current row from the previous row
df_sorted['Yearly_Jump'] = df_sorted.groupby('Country')['Price_Index'].diff()
# Finding the Maximum Jump for each country
# We look for the highest positive number (biggest increase)
max_spikes = df_sorted.groupby('Country')['Yearly_Jump'].max().sort_values(ascending=False).reset_index()
# Plotting using Plotly
fig = px.bar(max_spikes,
x='Country',
y='Yearly_Jump',
color='Yearly_Jump',
color_continuous_scale='Reds',
text_auto='.1f',
title='Largest Single-Year House Price Spike (Volatility Check)')
fig.update_layout(
xaxis_title="Country",
yaxis_title="Max Index Points Gained in 1 Year",
height=600
)
fig.write_html("Price_Spikes.html")
fig.show()
Interpretation¶
- The Volatility Leaders (Sweden & China)
- Observation: Sweden experienced the most extreme single-year shock in the entire dataset, with its index jumping by 89.2 points in just 12 months. China followed closely with an 83.6 point spike.
- Insight: Such massive increases (nearly doubling the index value in one year) are indicative of extreme market overheating, likely driven by speculative bubbles or aggressive monetary stimulus (cheap debt) that flooded the housing market.
- Widespread Instability
- Observation: The chart shows that volatility was not isolated. 13 out of 20 countries saw a single-year spike of over 60 points.
- Insight: This confirms that the "housing boom" was a synchronized global event. It wasn't just one country; the majority of the world's major economies faced a massive, sudden repricing of housing assets.
- The "Stable" Outliers
- Observation: India, Brazil, and Italy occupy the lower end of the volatility spectrum, with max spikes around 27-33 points.
- Insight: While still significant, these markets were far less prone to the "mania". Their growth was likely more gradual rather than explosive.
Question 2: Does "More Construction" lead to "Lower Prices"?¶
Economic theory suggests high supply (Construction) should lower prices. We evaluate this by looking at the relationship between consturction and price index. A negative correlation would support the theory, while a positive or weak correlation might suggest other factors are at play.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_new, x='Construction_Index', y='Price_Index',
hue='Country', size='GDP_Growth', sizes=(20, 200), alpha=0.7)
plt.title('Does Construction Activity Lower Prices? (Supply vs Demand)')
plt.xlabel('Construction Index (Higher = More Building)')
plt.ylabel('House Price Index')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.5)
plt.savefig('Construction_vs_Price.png', bbox_inches='tight')
plt.show()
Interpretation¶
- The "Random Scatter" (No Correlation)
- Observation: The data points are scattered randomly across the chart like a cloud. There is no clear trend line going up or down. Insight: This contradicts the basic economic assumption that "More Supply leads to Lower Prices." In this global dataset, countries with High Construction Activity (Right side, X > 130) often have just as high prices as countries with Low Construction Activity.
- Supply Chasing Demand
- Theory: We expected a downward slope (building more homes should cool the market).
- Reality: The lack of a downward slope suggests a reactive market. Developers likely build more when prices are high to maximize profits. Therefore, high construction numbers often coincide with high prices, rather than lowering them immediately.
- GDP Influence
- Observation: The size of the bubbles (GDP Growth) varies across the board.
- Insight: We see large bubbles (High GDP Growth) at both high and low price points. This indicates that economic growth alone isn't the sole driver of housing inflation.
Question 3: How does Urbanization affect Affordability?¶
As populations shift toward urban areas, housing demand in cities increases. We investigate how urbanization rates impact housing affordability by analyzing the relationship between the percentage of people living in urban areas and key affordability metrics like the price-to-income ratio.
plt.figure(figsize=(10, 6))
# regplot (for scatter plot and trend line)
sns.regplot(data=df_new, x='Urbanization_Rate', y='Affordability_Ratio',
color='purple', line_kws={'color': 'orange'})
plt.title('Impact of Urbanization on Affordability')
plt.xlabel('Urbanization Rate (%)')
plt.ylabel('Affordability Ratio (Higher = Worse)')
plt.savefig('Urbanization_vs_Affordability.png')
plt.show()
# Simple correlation number
corr = df_new['Urbanization_Rate'].corr(df_new['Affordability_Ratio'])
print(f"Correlation: {corr:.2f}")
Correlation: -0.18
Interpretation¶
- The "Wealth Effect" (Negative Correlation)
- Observation: The regression line slopes downward, and the correlation is -0.18. This indicates a weak negative relationship: as Urbanization increases, the Affordability Ratio tends to decrease (meaning housing becomes more affordable).
- Insight: We usually assume big cities are expensive. However, highly urbanized nations (like Germany or Japan) typically have higher average incomes. While the price tag of a home in a city is high, the salaries are also higher. In contrast, less urbanized developing nations might have lower house prices but significantly lower wages, resulting in a "worse" (higher) affordability ratio.
- Weak Predictive Power
- Observation: The correlation is very close to zero (-0.18), and the dots are scattered widely around the line.
- Insight: Urbanization alone is not a strong predictor of housing affordability. We can find "unaffordable" markets (high Y-axis values) in both rural-heavy and city-heavy countries. Other factors like interest rates and government policy clearly matter more than just the population density.
Conclusion¶
Summary of Findings¶
This analysis of the Global Housing Market (2015-2024) revealed a decade of extreme volatility and a widening gap between housing costs and average incomes.
- Global Boom & Volatility: The market did not grow linearly. We observed a synchronized global price spike in 2021 followed by a correction in 2022. The "Single-Year Spike" analysis showed that countries like Sweden and China experienced shock events where the index nearly doubled in a single year.
- The Buy-vs-Rent Disconnect: Our scatter plot analysis highlighted a potential bubble. In almost every country, the Price Index significantly outpaced the Rent Index, suggesting that speculative asset appreciation, rather than fundamental rental value, drove prices up.
- Affordability Crisis: While some regions (Brazil, Germany) remained relatively stable, others (Russia, South Africa) saw affordability ratios climb to critical levels.
- Supply vs. Price: We found no correlation between construction activity and lower prices, indicating that increasing supply alone was insufficient to cool down these heated markets.
Final Thoughts¶
The data suggests that the "Global Housing Rollercoaster" was driven less by standard supply/demand mechanics and more by macroeconomic shifts (likely interest rates and pandemic-era stimulus). As we move into 2024, the market shows signs of heating up again, with high volatility remaining a key risk for the future.
Exporting the cleaned csv¶
df_new.to_csv('Anuj_Subedi_final_cleaned.csv', index=False)